TST: measure and log individual test durations in CI #677

neutrinoceros · 2025-08-01T14:52:49Z

This is a trial run, not intended for merging (at least not yet), as I'm curious to see exactly which test(s) is taking so long that CI requires about 2 hours.

codecov-commenter · 2025-08-01T16:36:37Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 45.19%. Comparing base (3373f93) to head (0c0cb83).
⚠️ Report is 25 commits behind head on master.
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@             Coverage Diff             @@
##           master     #677       +/-   ##
===========================================
- Coverage   62.28%   45.19%   -17.09%     
===========================================
  Files          11       87       +76     
  Lines         928    16835    +15907     
===========================================
+ Hits          578     7608     +7030     
- Misses        350     9227     +8877

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

neutrinoceros · 2025-08-01T17:36:42Z

I completely missed there were two separate calls to pytest, and of course I'm measuring the wrong one. Will update when back near a keyboard

neutrinoceros · 2025-08-01T19:35:50Z

I'm just noticing now that .test_durations exists and is how pytest-split knows how to split the suite into "balanced" batches. But I'm curious to see if this run finds any prominent differences.

neutrinoceros · 2025-08-02T08:03:35Z

I see the longest running tests are respectively ~30 and ~15 min . This is extremely long, CI jobs usually can complete well under a minute, so I'm puzzled.
@VChristiaens @IainHammond do you think it'd be acceptable that a manual selection of tests (these 2 included) be run only once in CI, as opposed to once per python version ? Also, could you give me a brief overlook of what these tests are doing and why they're expected to run for that long ?

IainHammond · 2025-08-02T08:27:52Z

I see the longest running tests are respectively ~30 and ~15 min . This is extremely long, CI jobs usually can complete well under a minute, so I'm puzzled. @VChristiaens @IainHammond do you think it'd be acceptable that a manual selection of tests (these 2 included) be run only once in CI, as opposed to once per python version ? Also, could you give me a brief overlook of what these tests are doing and why they're expected to run for that long ?

They do genuinely take this long and I'm aware it's far from ideal. I've been wanting to improve this for a while, and my approach has been to gradually speed up the functions themselves. This has had some degree of success since they used to run for almost 3 hours.

The tests spend the vast majority of their time on two functions - fm.firstguess and fm.mcmc_negfc_sampling. These are important functions but there might be a way to make the test less intensive while still simulating their real use. As for the python versions I'm open to ideas about this. They often reveal dependency issues and deprecated code so testing each version has actually been helpful. Just installing packages and import VIP doesn't always trigger an error. Do tests usually just have a single version of python? One option is to run the tests on 3.13 and the earlier versions just install and import VIP to see if anything pops up?

As for what the tests are doing, I believe Valentin has some small example data that he provides to most of the functions several times to test each use case.

neutrinoceros · 2025-08-02T08:31:45Z

Yes, running most your tests against multiple Python versions is extremely valuable. What I'm proposing is to exclude a selection of the suite from normal jobs and run this same subset against a single Python version to save resources (and possibly some wall time as well)

IainHammond · 2025-08-02T08:36:45Z

I'm okay with this, we'll see what Valentin thinks

neutrinoceros · 2025-08-02T09:53:21Z

Other than that, regarding performance, have you considered rewriting performance critical loops in a low level language (C, C++, Rust, or event Cython) ?

IainHammond · 2025-08-04T17:53:59Z

@neutrinoceros thanks for all of this! just going to piggy-back off this most recent PR and ask if there is an order to commit each of these PRs in? Or just accept the ones that are ready and leave the drafts?

neutrinoceros · 2025-08-04T20:26:13Z

Yes, there's an order in which my PRs are intended to be reviewed and merged. I'm very careful to only undraft the ones that have no dependencies, so please prioritize everything explicitly marked as ready for review (as in, not drafted). If more than one is ready at any given point, they should be independent and order doesn't matter.

VChristiaens · 2025-08-14T13:45:38Z

Hi @neutrinoceros, thanks for raising this. As @IainHammond explained, we have put some efforts to try to reduce as much as possible the tests computing time although we have started to hit a barrier in order to (i) keep them meaningful and (ii) cover as many parts of the code at the same time as possible (to avoid minimizing overheads from partly re-running the same bits of preliminary codes in different tests). That being said, I dived in the longest remaining tests, and I think there is a way to reduce the longest one by ~40%, while not reducing the coverage nor the meaningfulness, as explained below. Thanks btw for highlighting these with the proposed change (which I think should also be fine to keep in the PR, unless you see a reason not too?).

In essence, for now the speckle_noise_uncertainty function, which estimates the uncertainties on planet parameters based on the injection of fake planets, is tested with tests/pre_3_10/test_fm_negfc_3d.py::test_algos[pca-firstguess-5-True-None-False-None] while calling the pca function as PSF subtraction algorithm. Using the medsub algorithm is faster though. I just tried it on my side. So maybe this PR could be the opportunity to implement that. The only change involved is this line to be replaced withif pca_algo == median_sub:

To further reduce the length of the 2 longest tests (test_fm_negfc_3d.py::test_algos[pca-firstguess-5-True-None-False-None] and test_fm_negfc_3d.py::test_algos[median_sub-firstguess-None-False-sum-False-None]), I think it'd be fine to only test them with the latest Python version. I believe there should be enough redundancy in terms of dependent packages and functions in the other tests, in particular the other 3 ones of the same kind, to capture any version-related issue. So green light on my side to implement that. Thanks a lot!

neutrinoceros · 2025-08-15T08:40:13Z

Thanks btw for highlighting these with the proposed change (which I think should also be fine to keep in the PR, unless you see a reason not too?).

That's okay with me ! I initially used a pytest plugin to measure durations, but I didn't want to add another dependency, long term. Now that I've realized pytest had this feature natively, I don't mind if you guys want to merge the change as is.

So maybe this PR could be the opportunity to implement that. The only change involved is this line to be replaced withif pca_algo == median_sub:

I'll give it a spin !

we have put some efforts to try to reduce as much as possible the tests computing time although we have started to hit a barrier in order to (i) keep them meaningful and (ii) cover as many parts of the code at the same time as possible (to avoid minimizing overheads from partly re-running the same bits of preliminary codes in different tests).

I get the noble intention, but I find this approach quite counter-productive long term. I'll try to explain my perspective here (and to keep it brief :)):
By privileging high coverage-per-test-case, you're missing the opportunity to use tests as a form of design pressure. What I mean is that, going the other way, and trying to write tests that check as little as possible, one tends to design code that's intrinsically more modular and testable (so smaller bits can be tested independently). It also helps tremendously with covering parameter space significantly; combinatorics in functions that have more than a couple arguments typically make it impossible. As a side effect, aiming for smaller tests also tend to drastically reduce the duration of a full test session, to the extent that one might run every tests in a couple seconds instead of minutes or hours, so that it becomes possible to do it repeatedly and locally while developing, instead of once or twice when branches reach maturity.

Anyway, these are very general considerations, I'm not making a judgment on VIP specifically, nor have I studied it deeply enough to see where low hanging fruits would be.

…running test case

VChristiaens · 2025-08-18T16:41:28Z

Thanks for your explanations @neutrinoceros. I agree with the idea of testing as little as possible in each chunk. I think the test suite grew somewhat organically in VIP, and in some cases we saw the opportunity to incorporate very short bits of codes leveraging the outputs of another tested algorithm, which involved less code hassle than designing new tests.
That being said, at this stage, I don't see a way to chunk in smaller parts the longest remaining test, which is the main limiting in the test suite (an MCMC-based algorithm used to estimate uncertainties on the parameters of a fake planet), so I guess we'll have to live with it. Reducing the data or the number of iterations takes the risk of misestimating the uncertainties which is the whole point of the test.
Since the longest remaining test is significantly longer than the others, I was wondering: is there a way to specify manually how to split the tests in different groups? If that is possible, then the 2 longest remaining tests could basically make individual groups, and all the remaining ones be bundled together.

neutrinoceros added 2 commits August 15, 2025 10:43

TST: measure and log individual test durations in CI

cdc6b1c

TST: update a conditional in an attempt to reduce duration of a long-…

a60148b

…running test case

neutrinoceros force-pushed the tst/durations branch from a100fe5 to a60148b Compare August 15, 2025 08:44

neutrinoceros marked this pull request as ready for review August 15, 2025 10:38

TST: measure and log individual test durations in CI #677

Are you sure you want to change the base?

TST: measure and log individual test durations in CI #677

Conversation

neutrinoceros commented Aug 1, 2025

Uh oh!

codecov-commenter commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

neutrinoceros commented Aug 1, 2025

Uh oh!

neutrinoceros commented Aug 1, 2025

Uh oh!

neutrinoceros commented Aug 2, 2025

Uh oh!

IainHammond commented Aug 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

neutrinoceros commented Aug 2, 2025

Uh oh!

IainHammond commented Aug 2, 2025

Uh oh!

neutrinoceros commented Aug 2, 2025

Uh oh!

IainHammond commented Aug 4, 2025

Uh oh!

neutrinoceros commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

VChristiaens commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

neutrinoceros commented Aug 15, 2025

Uh oh!

VChristiaens commented Aug 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-commenter commented Aug 1, 2025 •

edited

Loading

IainHammond commented Aug 2, 2025 •

edited

Loading

neutrinoceros commented Aug 4, 2025 •

edited

Loading

VChristiaens commented Aug 14, 2025 •

edited

Loading